[ntuple] Add partioning to RNTupleJoinTable#17919
[ntuple] Add partioning to RNTupleJoinTable#17919enirolf merged 4 commits intoroot-project:masterfrom
RNTupleJoinTable#17919Conversation
6834e5a to
7ebbde6
Compare
Test Results 20 files 20 suites 5d 0h 3m 12s ⏱️ For more details on these failures, see this check. Results for commit 14daddb. ♻️ This comment has been updated with latest results. |
7ebbde6 to
4cc1604
Compare
vepadulano
left a comment
There was a problem hiding this comment.
Nice improvement! Some minor comments for now
4cc1604 to
8e2228d
Compare
silverweed
left a comment
There was a problem hiding this comment.
A couple of minor comments
8e2228d to
cdcfb92
Compare
hahnjo
left a comment
There was a problem hiding this comment.
I left many comments. The answer to some of them may well be that it's motivated by future changes. In that case please feel free to simply resolve the thread(s).
cdcfb92 to
130714e
Compare
1e7f395 to
3a3fa99
Compare
hahnjo
left a comment
There was a problem hiding this comment.
LGTM in principle, thanks for making all the changes. I have one final question about the semantics of GetPartitionedEntryIndexes, and what it should return in case there is no index in a partition.
With this addition, the join table can be split up into several mappings from join values to entry numbers, according to some (numeric) partition key. It has several use cases, but the immediate one is that with this approach, the join table is not restricted to one page source anymore. This is useful for the integration into the `RNTupleProcessor`, where we want to be able to create joins of chains of RNTuples and have to deal with more than one page source. As a side-effect, the state management of the join table and the notion of lazy building has changed. There is no single `isBuilt` state anymore, and the `Add` function eagerly builds the mapping for the provided page source and adds it to the join table. As such, the responsibility of deciding whether to eagerly or lazily build the join table is moved to the application using the join table (i.e. by strategically calling `Add`).
Since it is declared within `RNTupleJoinTable`, it is clear that this type belongs to RNTuple from the context.
3a3fa99 to
14daddb
Compare
vepadulano
left a comment
There was a problem hiding this comment.
Nice improvements, thanks!
With this addition, the
RNTupleJoinTablecan be split up into several mappings from join values to entry numbers, according to some (numeric) partition key. Using a partition key is optional; a default partition key is used when none has been specified by the user.The
RNTupleJoinTablenow has an inner class,REntryMapping, which in practice now represents what theRNTupleJoinTablerepresented previously, i.e., a mapping from join field values to entry numbers:The join table itself now instead provides a mapping from partition keys to (a collection of) these entry mappings:
The reason one partition can contain multiple entry mappings is twofold:
The most immediate use case of the partitioning approach added in this PR, is that this way, the
RNTupleJoinTableitself is not restricted to one page source anymore (this restriction is now inREntryMappinginstead). This is useful for the integration into theRNTupleProcessor, where we want to be able to create joins of chains of RNTuples and have to deal with more than one page source (see also #17132).As a side-effect, the state management of the join table and the notion of lazy building has changed. There is no single
isBuiltstate anymore, and theAddfunction eagerly builds the mapping for the provided page source and adds it to the join table. As such, the responsibility of deciding whether to eagerly or lazily build the join table is moved to the application using the join table (i.e. by strategically callingAdd).